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We present a systematic approach to the prediction of soccer matches. First, we show that the 
information about chances for goals is by far more informative than about the actual results. Second, 
we present a multivariate regression approach and show how the prediction quality increases with 
increasing information content. This prediction quality can be explicitly expressed in terms of just 
two parameters. Third, by disentangling the systematic and random components of soccer matches 
we can identify the optimum level of predictability. These concepts are exemplified for the German 
Bundesliga. 

PACS numbers: 



I. INTRODUCTION 

One important field is the prediction of soccer matches. 
In literature different approaches can be found. In one 
type of models [TlUO) appropriate parameters are intro- 
duced to characterize the properties of individual teams 
such as the offensive strength. Of course, the characteri- 
zation of team strengths is not only restricted to soccer; 
see, e.g., [5J. The specific values of these parameters can 
be obtained via Monte-Carlo techniques. These models 
can then be used for prediction purposes and allow one to 
calculate probabilities for individual match results. A key 
element of these approaches is the Poissonian nature of 
scoring goals [H 13 E] . Beyond these goals-based predic- 
tion properties also results-based models are used. Here 
the final result (home win, draw, away win) is predicted 
from comparison of the difference of the team strength 
parameters with some fixed values [12j . The quality of 
both approaches has been compared and no significant 
differences have been found [13] . Going beyond these 
approaches additional covariates can be included. For 
example home and away strengths are considered indi- 
vidually or the geographical distance is taken into ac- 
count [T3]. Recently, also the ELO-based ratings have 
been used for the purpose of forecasting soccer matches 

Recent studies suggest that statistical models are supe- 
rior to lay and expert predictions but have less predictive 
power than the bookmaker odds [T4lU7| . This observa- 
tion strongly suggests that either the information, used 
by the bookmakers, is more powerful or, alternatively, 
the inference process, based on the same information, is 
more efficient. Probably, both aspects may play a role. 

When predicting soccer matches different key aspects 
have to be taken into account: (i) Choice of appropriate 
observables which contain optimum information about 
the individual team strengths, (ii) Definition and subse- 
quent estimation of the team strength, (iii)Estimation of 
the outcome of a soccer match based on the two team 
strengths, (iv) Additional consideration of the stochastic 
(Poissonian) contributions to a soccer match. The final 
two aspects have been analyzed in detail in Ref.|18). 



In the present work we concentrate on the first two 
aspects. Therefore we are restricting ourselves to pre- 
dict the outcome of the second half of the season, i.e. 
summing over the final 17 matches (in the German Bun- 
desliga). To reach this aim the stochastic aspects are 
somewhat easier to handle than for the prediction of a 
single match so that we can concentrate on (i) and (ii). 
However, all concepts can be also directly applied to the 
prediction of single soccer matches. Furthermore, our 
analysis can naturally be transferred to all other soccer 
leagues. As a key result we identify the level of opti- 
mum predictability and determine how close our actual 
inference approaches this optimum level. 

It will turn out that the chances for goals are highly 
informative. They are provided by a professional sports 
journal (www.kicker.de) since the season 1995/96. In to- 
tal we take into account all seasons until 2010/11. Since 
the definition of the chances for goals has slightly changed 
during the first years of the reporting period we have nor- 
malized the chances for goals such that their total number 
is identical in every season. 



II. KEY ELEMENTS OF THE PREDICTION 
PROCESS 

A. Systematic and stochastic effects in soccer 
matches 

Our general goal is the prediction of the future results 
of soccer matches. More specifically, we concentrate on 
the prediction of the outcome of the second half of the 
league tournament (German Bundesliga). This second 
half involves N2 = 17 matches. We want to predict the 
final goal difference AG2 of each team after these N% 
match. A similar analysis could also be performed for 
points. We mention in passing that the information con- 
tent of the goal difference about the team strength is 
somewhat superior to that of points |19j . 

In previous work we have defined the team strength S2 
of a team as the expected average goal difference when 
playing against all other 17 teams. Strictly speaking, S2 
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could be strictly determined if this team plays very often 
against the other 17 teams under identical conditions. 

Let AG 2 (N 2 ) denote the goal difference of some team 
after N 2 matches in the second half, normalized per 
match. Then AG 2 (N 2 ) can be expressed as the sum of its 
strength S 2 and a random variable £, which denotes the 
non-predictable contributions in the considered matches. 
In what follows we assume that the variance of £ is not 
correlated to the strength index S 2 . Taking into account 
that the random contributions during different matches 
are uncorrelated one immediately obtains 



Var(AG 2 (N 2 )) = Var(S 2 ) + V 2 /N 2 



(1) 



where V 2 describes the variance of the random contribu- 
tion during a single match and Var(S 2 ) reflects the vari- 
ance of the distribution of team strengths in the league 
[!§] . The l/A^-scaling simply expresses that the statisti- 
cal effects average out when taking into account a larger 
number of matches. This scaling only breaks down for 
N 2 close to unity because then the goal difference also 
depends on the strength of the opponent. In practice it 
turns out that for N 2 > 4 the difference of the N 2 op- 
ponents has sufficiently averaged out. This dependence 
on the number of considered matches has been explicitly 
analyzed in Ref. [20] ■ For the present set of data we 
obtain Var(S 2 ) = 0.21 and V 2 = 2.95. Actually, V 2 is 
very close to the total number of goals per match (2.85). 
This expectation is compatible with the assumption of a 
Poissonian process. 



B. Prediction within one season 

In an initial step we use information from the first half 
of the season to predict the second half. The indepen- 
dent variable in the first half is denoted as Y, the de- 
pendent variable in the second half as Z. As the most 
simple approach we formulate the linear regression prob- 
lem Z — bY. In what follows all variables fulfill the 
condition that the first moment of the variable, if aver- 
aged over all teams, is strictly zero. Generalization is, of 
course, straightforward. The regression problem requires 
the minimization of ((Z — Z) 2 ) with respect to b where 
Z = bY is the explicit prediction of Z. Inserting the 
resulting value of b opt yields for this optimum quadratic 
variation 



X 2 (Y) = Var{Z) [l - [corr{Y,Z)f 



(2) 



where Var(Z) denotes the variance of the distribution of 
Z and 



corriY, Z) 



(YZ) 



y/Var{Y)Var(Z) 



(3) 



pretation. The higher the correlation between the vari- 
ables Y and Z the better the predictability of Z in terms 
of Y. 

To be somewhat more general, we consider the case 
that exactly N\{< 17) matches in the first half of the 
season have been taken into account to define the in- 
dependent variable Y . Whenever we want express the 
dependence on N\ we use the terminology Y{Ni). With- 
out this explicit dependence we always refer to N± = 17. 
To reduce the statistical errors we always average over 
different random selections of N± matches from the first 
half of the season. 



C. Choice of observables 

A natural choice for the variable Y is the goal differ- 
ence AGi during the first half. We always assume that 
the results have been corrected for the average home ad- 
vantage in that season. The quality of the prediction is 
captured by corr(Y, Z); see Eq{2] From the empirical 
data we obtain corr(Y = AGi,Z = AG 2 ) = 0.56. 

Are there other observables Y which allow one to in- 
crease corr(Y, AG 2 ) significantly beyond the value of 
0.56? The scoring of goals is the final step in a series 
of match events. One may thus expect that there exist 
other match characteristics which are more informative 
about the team strength. A possible candidate is the 
number of chances for goals. We denote the chances for 
goals as C± and the goals as G±. The sign indicates 
whether it refers to the considered team (+) or the op- 
ponent of that team (-). 

In a next step one can define the goal efficiencies p± 
via the relation 



G± = C ± - P± . 



(4) 



the Pearson correlation coefficient between the vari- 
ables Y and Z. This relation has a simple intuitive inter- 



Here, p+ denotes the probability that the team is able 
to convert a chance for a goal into a real goal and l—p- 
that the team manages to not concede a goal after a 
chance for a goal of the opponent. Averaging over all 
teams and seasons one obtains (p±) = 0.24. In analogy to 
AG we will mainly consider the difference AG = G+ — G_ 
for prediction purposes. 

If the goal efficiencies strongly vary from team to team 
in an a priori unknown way the chances for goals contain 
only very little information about the actual number of 
goals. If, however, the goal efficiencies are identical for all 
teams the chances for goals are more informative than the 
goals themselves. In Appendix I this general statement 
is rationalized for a simple model. 

In Figjl] the actual goal efficiencies p+ after a season 
are shown together with the respective values of AG. 
Naturally, AG is strongly positively correlated with the 
team strength. Two effects are prominent. (1) There is a 
slight correlation between AG and p + . On average bet- 
ter teams have a slightly better efficiency to score goals. 
Analogous correlations exist between p_ and AG. (2) 
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FIG. 1: The efficiency factors p± as a function of the differ- 
ences of the chances for goals AC 

The goal efficiencies are widely distributed between ap- 
prox. 15% and 35%. This observation would indicate 
that the information content of the chances for goals 
about the resulting team strength, defined in terms of 
scoring goals, is quite limited. Surprisingly, this is not 
true. For the correlation coefficient corr(Y = ACi, Z = 
AG 2 ) one obtains a value of 0.65 which is much larger 
than corr(Y = AGi,Z = AG 2 ) = 0.56. 

To understand this high correlation for the chances for 
goals with the team strength we can discuss the reason for 
strong fluctuations of p± between the different teams. In 
general they are a superposition from two effects: (i) true 
differences between teams and (ii) statistical fluctuations, 
reflecting the random effects in the 34 soccer matches of 
the season. Both effects can be disentangled if one anal- 
yses the iV-dependence of the variance of p± . Whereas 
the statistical effects should average out for large N the 
systematic effects remain for all N. In analogy to Eq{l] 
this can be written as 

Var(p ± (N)) = Var(p±) + const ± /N (5) 

Var(p±) can be interpreted as the true variance of the 
distribution of p± void of any random effects. This In- 
dependence of p+(N) is explicitly shown in Fig[2] Ob- 
viously, one obtains very small values for Var(p + ) and 
Var(p^) (0.00017 ± 0.00010 and 0.00018 ± 0.00010, re- 
spectively). Thus, by far the largest contributions to the 
scatter of Var(p±(N — 34)) in Figfl] is due to random 
effects. Stated differently, beyond the minor correlation 
between p± and AC, shown in Figfl] the efficiency to 
score a goal out of a chance for a goal is basically the 
same for all teams! 

To better understand the statistical properties of the 
chances for goals we again disentangle the systematic and 
random parts by writing 

Var(AC 1 (N 1 )) = Var(S 1 ) + ^. (6) 
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FIG. 2: The variance of the distribution of goal efficiencies 
in dependence of the number of match days. 

One obtains Var(Si) = 2.66 and V x = 14.2. Based 
on this relation it is possible to discuss the individ- 
ual contributions to the Pearson correlation coefficient 
corr ( ACi (JVi ), AG2). Using the independence of the 
random effects in the first and the second half of the 
season one obtains 

corr{Y = AC 1 (N 1 ),Z= AG 2 ) 

corr(S 1 ,S 2 ) , , 

y/1 + Vi/(JViVar(Si)yi + V 2 / (17 V ar(S 2 )) 

This expression clearly shows that there are three rea- 
sons why the prediction has intrinsic uncertainties, i.e. 
the correlation coefficient is smaller than unity. First, the 
team strength may change in the course of the season, i.e. 
corr (Si, S 2 ) < I- Since all parameters on the right side 
are explicitly known (see above) we can evaluate Eq{7j 
e.g., for Ni = 17. We obtain corr(Si,S 2 ) = 1.00. Thus, 
the variation of the team strength during a single sea- 
son is basically absent; see also Ref.[18 . Second, the 
estimation of the team strength in the first half of the 
season is hampered by random effects, as expressed by 
Vi/Var(Si) > 0. Of course, the larger the information 
content, i.e. the larger Ni, the better the prediction. For 
the chances for goals this ratio is given by 5.3. If we had 
based Y on AG rather than AC we would have obtained 
a value of 11.1. This comparison explicitly reveals why 
the chances for goals are more informative. Knowledge 
of the chances for goals of 10 matches is as informative 
as the goal differences of approx. 21 matches. Third, the 
prediction of AG2 always has intrinsic uncertainties due 
to the unavoidable random effects in the second half of 
the season, i.e. V 2 /Var(S 2 ) > 0. 

Eq(7] allows one to define the limit of optimum predic- 
tion. It this case Y would be explicitly given by Si, i.e. 
Vi = 0. This yields corr(Y,Z = AG 2 ) = 0.73. This 
shows that the improvement of taking the chances for 
goals (corr(Y = AC X ,Z = AG 2 ) = 0.65) rather than 
the goals (corr {J = AGi,Z = AG 2 ) = 0.56) indeed is a 
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2.95 



TABLE I: The different systematic and random contributions 
of the observables, relevant for this work. 

major improvement relative to this optimum limit. 



D. Going beyond the present season 

Naturally, the prediction quality can be further im- 
proved by incorporating information from the previous 
season about the team strength. This additional variable 
is denoted as X. Here we consider the chances for goals 
of the previous season which we denote X = ACo- One 
obtains corr(ACo, AG 2 ) = 0.56. In principle one can 
again analyse the systematic and random contributions 
of ACq(Nq). The corresponding iVo-dependent variance 
reads (see Eq(6) 

Var(AC (N )) = Var(S a ) + ^ (8) 

with Var(So) = 2.32 and Vo = 14.1. For reasons of 
comparison all relevant statistical parameters are sum- 
marized in Tab III Dl 

Of course, both values are close to Var(Si) and V\. 
The small differences expresses the fact that the sta- 
tistical properties of the first and the second half of 
the season are slightly different [20J. Using the same 
reasoning as in the context of Eq[7] one finally obtains 
corr(S ,S 2 ) = 0.88 and corr(S , Si) = 0.86. Both val- 
ues are identical within statistical errors. This is compat- 
ible with the observation that the team strength does not 
vary within a season. The fact that both values are sig- 
nificantly smaller than unity shows, however, that there 
is a small but significant variation of the team strength 
between two seasons. For future purposes we use the 
average value of corr(So, Si t2 ) = 0.87 for the character- 
ization of the correlation of the team strength between 
two seasons. 



III. QUALITY OF THE REGRESSION 
PROCEDURE 

A. General information content 

For small N\, i.e. at the beginning of the season, the 
information content about the strength of a team is quite 
limited. Therefore it is essential to incorporate also team 
information which is already available at the beginning 
of the tournament, i.e. reflects the strength of this team 
from the past season. Thus, before the first match the 
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FIG. 3: Schematic representation of the general prediction 
setup. 

prediction is fully based on X and gradually with an in- 
creasing number of matches the variable Y contains more 
and more information about the present team strength 
and thus will gain a stronger statistical weight in the 
inference process. This setup is sketched in Figj3j As 
discussed above we choose for X the chances for goals of 
the previous season. The general relations, however, also 
hold beyond this specific choice. 

Interestingly, the quality of the multivariate prediction 
can be expressed in analogy to Eq(2] and reads 

X 2 (X, Y) = X 2 (Y) [1 - [corr{X -Y,Z - Y)} 2 ] (9) 

where the partial correlation coefficient 

_ Y z -Y) = corr ( X ' Z ) ~ corr(X, Y)corr(Y, Z) 
y/1 - corr(X, F) V 1 - corr(Y, Z) 2 

(10) 

has been used. x 2 {Y) has been already defined in Eq{2] 
The second factor on the right-hand side of Eqj9] explic- 
itly contains the additional information of the variable X 
as compared to Y. One can easily show that in agreement 
with expectation Eq|9]is completely symmetric in X and 
Y. Since Eqj9]is non-standard it is explicitly derived in 
the Appendix II via some general arguments. 

B. Estimation of the team strength 

So far, we have identified Z with the goal difference 
in the second half of the season which is composed of S2 
and the non-predictable random effects as expressed by 
Var(AG 2 ) = Var(S 2 ) + V 2 /17. Now we define 

X 2 (X.Y)^ X 2 (X 1 Y)-V 2 /17. (11) 

This can be interpreted as the statistical error for the 
prediction of the individual team strengths. In case of a 
perfect estimation of the team strengths one would have 
X 2 (X, Y) = 0. Mathematically this result can be derived 
by choosing Z = S 2 rather than Z = AG 2 in Eq(9j After 
employing some straightforward algebraic manipulations 
of Eq|9mne directly obtains % 2 (A", Y). 
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FIG. 4: The prediction quality of the team strength, deter- 
mined via ^/x 2 (X, Y), is shown as a function of the number of 
match days Ni . Different choices of variables are shown. The 
solid lines are based on the explicit formulas for the prediction 
quality. 



IV. RESULTS 

A. Numerical results 

For each value of N% we have performed a multivariate 
regression analysis, yielding ^(X^Y), and finally sub- 
tracted V2/I7. As before we have chosen several subsets 
of N\ matches from the first half of the season to de- 
crease the statistical error. Now we proceed in two steps. 
First, we neglect the contribution of A, i.e. the informa- 
tion from the previous season. The results are shown in 
Figj4] One can see that (trivially) for Ni — the stan- 
dard deviation in the estimation of the team strength is 
identical to the standard deviation of the ^-distribution. 
The longer the season, the more information is available 
to distinguish between stronger and weaker teams. Using 
the information of the complete first half of the season 
(Ni = 17) the statistical uncertainty decreases to 0.22. 
Here one can explicitly see the advantage of using the 
chances for goals rather than the goals themselves. Re- 
peating the same analysis with the number of goals one 
would have an uncertainty of 0.30 after Ai = 17 matches 
which is significantly higher than the value of 0.22, re- 
ported above. Second, when additionally incorporating 
the information from X, the statistical uncertainty is al- 
ready quite small at the beginning of the season (0.3). Of 
course, during the course of the season it becomes even 
smaller. Even after 17 matches the additional gain of 
using X is significant (0.22 vs. 0.19). 



B. Analytical results 



X (A, Y) can be also calculated analytically by in- 
corporating the statistical properties of the variables 



A, Y, and Z. For future purposes we abbreviate d 
Vi/Var{S\). First, we have (using (corr(S\, S2) = 1) 



corr(Y = AC 1 (N 1 ),S 2 ) 



1 



v/1 + d/Ni 
Furthermore, we express corr(X, S2) as 



(12) 



iv An cm corr(S ,S lt2 ) _ , lq s 

corriX = ACo, 02) = — — = c, (13) 

y/1 + V /(17Var(S )) 

In analogy one obtains 
corr(X = AC ,Y = ACi(Ai)) 



y/l + d/Nl 



(14) 



In summary, all information is contained in 
the two constants c and d. A straightfor- 
ward calculation yields corr(X — Y, Z — Y) = 
cy/l - 1/(1 + dJWj/y/l - c 2 /(l + d/N). Finally, 
one ends up with 



X 2 (X,Y) = Var(S 2 ) 



^ l+d/N 



1-C 2 



l+d/N 



(15) 



Now we can compare the actual uncertainty, as already 
shown in Fig[3J with the theoretical expectation, as ex- 
pressed by the analytical result Eq[l5j The results are 
included in Fig|4j To reproduce the case without the 
variable A one can simply choose c = 0. One can see a 
very close agreement with the actual data. 

Is this good agreement to be expected? Actually, our 
analysis just contains two approximations. First, we have 
chosen corr(So, Si) — cott^Sq, S2) which, indeed, holds 
very well (see above) . Second, we have assumed that the 
team strength does not vary during the first half of the 
season. As shown in Ref.[2"U] the team strength fluctu- 
ates with a small amplitude of approx. A = 0.17 and 
with a decorrelation time of approx. 7 matches. Since 
we average over different choices of Ni matches and, fur- 
thermore, restrict ourselves to the prediction of the total 
second half, these temporal fluctuation are to a large ex- 
tent averaged out. 



V. DISCUSSION 

The main goal was (i) to analyse the information con- 
tent of different observables and (ii) to better understand 
the limits of the prediction of soccer matches. The pre- 
diction quality could be grasped by the two parameters c 
and d. One can easily see that the theoretical expression 
for the prediction quality Eq |15| approaches the limit of 
perfect prediction in two limits (i) For c = 1 and d = the 
information from either the previous or from the present 
season, respectively, perfectly reflects the present team 
strength, (ii) For N% — ¥ 00 all random effects have aver- 
aged out so that only the systematic effects remain. 
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the present value of 7.8 to a value closer to 7.1. Second, 
one may be interested in the prediction of a single match. 
This case is somewhat different. Since the team fluctu- 
ations are very difficult to predict the fluctuation ampli- 
tude A = 0.17 [2U] serves as a scale for estimating the 
quality of prediction. If the uncertainty is much smaller 
than A any further improvement would not help. In the 
present case the statistical error is close to A so that a 
further reduction of x 2 (X, Y) would still be relevant for 
prediction purposes. 

Note that the chances for goals are not completely 
objective observable because finally also the subjective 
judgement of a sports journalist may influence its es- 
timates. In this sense the high information content of 
chances for goals indicates that the subjective component 
is quite small and the general definition is very reason- 
able. Of course, in the future one may look for strictly 
objective match observables taken by companies such as 
Opta and Impire to further improve the information con- 
tent. 

We gratefully acknowledge helpful discussions with D. 
Riedl, B. Strauss, and J. Smiatek. 



FIG. 5: The uncertainty of the prediction of the goal differ- 
ence of the second half when using the complete information 
of the first half (TVi = 17). Different choices of variables are 
shown. Furthermore, the limit of perfect predictability is in- 
dicated. VI. APPENDIX I 



This result can be easily generalized. For example one 
can show for the German Bundesliga that the market 
value, determined before the season, is highly informative 
for the expected outcome. Taking an appropriately cho- 
sen linear combination of different observables one may 
slightly increase the value of c but keeping the general 
structure of Eq{l5] identical. 

The same analysis could have been also performed by 
predicting points rather than goal differences. Both ob- 
servables are linearly correlated via the simple relation 
P-2 = O.6IS2 + 23. In analogy to 5 2 the value of P 2 de- 
notes the expected number of points which a team gains 
in a match against an average team of the league in a 
neutral stadium. Thus, an average team (S — 2 — 0) on 
average gains 23 points per half-season. 

One interesting question arises: is the residual statisti- 
cal error of S2 for Aq = 17 small or large? This question 
may be discussed in two different scenarios. First, one 
may want to predict the outcome of the second half of the 
league. In the pr esent c ontext the uncerta inty is given 
by n^/ X 2 {X,Y) = 17y/x 2 (X, Y) + V 2 /17. These val- 
ues are plotted for different prediction scenarios in Fig(5] 
One can see how the additional information decreases the 
uncertainty of the prediction. Most importantly, the no 
man's land below an uncertainty of \jYTV<x = 7.1 cannot 
be reached by any type of prediction. The art of ap- 
proaching this perfect prediction thus resorts to decrease 



Here we consider a simple example of a fictive coin- 
tossing tournament where the head appears with proba- 
bility p which in this simple example is given by 1/2. A 
team is allowed to toss the coin M times per round. In 
the first round this results in g\ times tossing the head. 
Thus, in the first round one has observed the number of 
tosses M as well as the number of heads g\. In the re- 
lation to soccer M would correspond to the number of 
chances for goals and g\ to the number of goals in that 
match. In order to keep the argument simple we assume 
that M is a constant whereas in a real soccer match M 
can naturally vary. How to predict the expected num- 
ber of goals (72 in the next round? Here we consider two 
different approaches. (1) The prediction is based on the 
achievement of the first round, i.e. on the value of g\. 
Then the best prediction is g 2 = g\ . The variance of the 
statistical error of the prediction can be simply written 
as Eoi, S 2 Ks'iM^Xfl'i - 92? where p(g) is the binomial 
distribution. A straightforward calculation yields for this 
variance a value of 2Mp{\ — p). (2) The prediction is 
based on the knowledge of tossing attempts. The opti- 
mum prediction is, of course, pM. The variance of the 
statistical error is given by the binomial distribution, i.e. 
by Mp(l — p). Stated differently, knowing the number 
of attempts to reach a specific goal (here tossing a head) 
is more informative that the actual number of successful 
outcomes as long as the probability p is well known. 
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VII. APPENDIX II 

Here we show a simple derivation of the chosen form 
of x 2 (A, Y). Let dyz denote the solution of the regres- 
sion problem Z = dY. Accordingly, dyx is the so- 
lution of the regression problem X = dY. In a next 
step one defines the new variables Z = Z — dyzY and 
X = X — dyxY ■ For these new variables the correla- 
tion with Y is explicitly taken out. A straightforward 
calculation shows that the Pearson correlation coefficient 
corr(X, Z) is exactly given by the partial correlation co- 
efficient corr(X -Y,Z-Y). 

Now we consider the regression problem of interest Z = 
aX + bY. In a first step it is formally rewritten as 

Z - d YZ Y = a(X - d YX Y) + {b- d YZ + ad YX )Y. (16) 

Using the above notation and introducing the new re- 



gression parameter b we abbreviate this relation via 

Z = aX + W. (17) 

By construction the observable Y is uncorrelated to X 
and Z. Therefore the independent variable Y does not 
play any role for the prediction of Z so that effectively one 
just has a single- variable regression problem. Therefore 
one can immediately write 



X 2 (X, Y) = Var(Z) 1 - [corr(X, Z)] 



(18) 



The first factor is identical to x 2 (Y) whereas the Pearson 
correlation coefficient in the second factor is identical to 
corr(X — Y,Z — Y). This concludes the derivation of 

x 2 (a,y). 
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